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Random utility theory models an agent's preferences on alternatives by drawing 
^ | a real-valued score on each alternative (typically independently) from a param- 

eterized distribution, and then ranking the alternatives according to scores. A 
special case that has received significant attention is the Plackett-Luce model, for 
which fast inference methods for maximum likelihood estimators are available. 

, , This paper develops conditions on general random utility models that enable fast 

inference within a Bayesian framework through MC-EM, providing concave log- 
*_h likelihood functions and bounded sets of global maxima solutions. Results on 

both real-world and simulated data provide support for the scalability of the ap- 
^ proach and capability for model selection among general random utility models 

Q including Plackett-Luce. 

1 Introduction 

Problems of learning with rank-based error metri cs fl6) and the adoption of learning for the purpose 
of rank aggregation in social choice [7, 8 , 23, 25, 29, 30 1 are gaining in prominence in recent years. 
In part, this is due to the explosion of socio-economic platforms, where opinions of users need to be 
aggregated; e.g., judges in crowd-sourcing contests, ranking of movies or user-generated content. 

In the problem of social choice, users submit ordinal preferences consisting of partial or total ranks 
on the alternatives and a single rank order must be selected to be representative of the reports. 
Since Condorcet [6], one approach to this problem is to formulate social choice as the problem 
of estimating a true underlying world state (e.g., a true quality ranking of alternatives), where the 
individual reports are viewed as noisy data in regard to the true state. In this way, social choice can 
be framed as a problem of inference. 

In particular, Condorcet assumed the existence of a true ranking over alternatives, with a voter's pref- 
erence between any pair of alternatives a, b generated to agree with the true ranking with probability 
p > 1/2 and disagree otherwise. Condorcet proposed to choose as the outcome of social choice the 
ranking that maximizes the likelihood of observing the voters' preferences. Later, Kemeny's rule 
was shown to provide the maximum likelihood estimator (MLE) for this model (32). 



But Condorcet's probabilistic model assumes identical and independent distributions on pairwise 
comparisons. This ignores the strength in agents' preferences (the same probability p is adopted 
for all pairwise comparisons), and allows for cyclic preferences. In addition, computing the winner 
through the Kemeny rule is 0.^-complete |13J. 

To overcome the first criticism, a more recent literature adopts the random utility model (RUM) 
from economics (26) . Consider C = {ci,..,c m } alternatives. In RUM, there is a ground truth 
utility (or score) associated with each alternative. These are real-valued parameters, denoted by 
9 = (6i,.., ,9 m ). Given this, an agent independently samples a random utility (Xj) for each 
alternative Cj with conditional distribution Hj('\6j). 
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Usually 6j is the mean of /fjOl^jOQ Let tt denote a permutation of {1, ... , m}, which naturally 
corresponds to a linear order: [c^m >- c~k{2) >-•••>- Cn-fm)]- Slightly abusing notation, we also use 
7r to denote this linear order. Random utility {X\, . . . , X m ) generates a distribution on preference 
orders, as 



Pr(7r | 3) = Pi(X n{1) > X„ {2 ) > ■ ■■> X <m) ) 
The generative process is illustrated in Figure [T] 
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Figure 1: The generative process for RUMs. 



Adopting RUMs rules out cyclic preferences, because each agent's outcome corresponds to an order 
on real numbers, and it also captures the strength of preference, and thus overcomes the second 
criticism, by assigning a different parameter (Of) to each alternative. 



A popular RUM is Plackett-Luce (P-L) |18{|21) , where the random utility terms are generated ac- 
cording to Gumbel distributions with fixed shape parameter p}[3T). For P-L, the likelihood function 
has a simple analytical solution, making MLE inference tractable. P-L has been extensively applied 
in econometrics (T|[l9), and more recently in machine learning and information retrieval (see 1 16 1 
for an overview). Efficient methods of EM inference [5 14], and more recently expectation propa- 
gation [ 12 1, have been developed for P-L and its variants. 



In application to social choice, the P-L model has been used to analyze political elections 1 10 1. EM 
algorithm has also been used to learn the Mallows model, which is closely related to the Condorcet's 
probabilistic model fVT\ . 

Although P-L overcomes the two difficulties of the Condorcet-Kemeny approach, it is still quite 
restricted, by assuming that the random utility terms are distributed as Gumbel, with each alternative 
is characterized by one parameter, which is the mean of its corresponding distribution. In fact, little 
is known about inference in RUMs beyond P-L. Specifically, we are not aware of either an analytical 
solution or an efficient algorithm for MLE inference for one of the most natural models proposed by 
Thurstone |26| , where each Xj is normally distributed. 



1.1 Our Contributions 

In this paper we focus on RUMs in which the random utilities are independently generated with 
respect to distributions in the exponential family (EF) (20) . This extends the P-L model, since 
the Gumbel distribution with fixed shape parameters belonging to the EF. Our main theoretical 
contributions are Theorem [T] and Theorem [2] which propose conditions such that the log-likelihood 
function is concave and the set of global maxima solutions is bounded for the location family, which 
are RUMs where the shape of each distribution fij is fixed and the only latent variables are the 
locations, i.e., the means of /i/s. These results hold for existing special cases, such as the P-L 
model, and many other RUMs, for example the ones where each /ij is chosen from Normal, Gumbel, 
Laplace and Cauchy. 



Hi{-\9j) might be parameterized by other parameters, for example variance. 
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We also propose a novel application of MC-EM. We treat the random utilities (X) as latent variables, 
and adopt the Expectation Maximization (EM) method to estimate parameters 9. The E-step for 
this problem is not analytically tractable, and for this we adopt a Monte Carlo approximation. We 
establish through experiments that the Monte-Carlo error in the E-step is controllable and does not 
affect inference, as long as numerical parameterizations are chosen carefully. In addition, for the E- 
step we suggest a parallelization over the agents and alternatives and a Rao-Blackwellized method, 
which further increases the scalability of our method. 

We generally assume that the data provides total orders on alternatives from voters, but comment on 
how to extend the method and theory to the case where the input preferences are partial orders. 

We evaluate our approach on synthetic data as well as two real-world datasets, a public election 
dataset and one involving rank preferences on sushi. The experimental results suggest that the 
approach is scalable despite providing significantly improved modeling flexibility over existing ap- 
proaches. 

For the two real-world datasets we have studied, we compare RUMs with normal distributions and 
P-L in terms of four criteria: log-likelihood, predictive log-likelihood, Akaike information criterion 
(AIC), and Bayesian information criterion (BIC). We observe that when the amount of data is not 
too small, RUMs with normal distributions fit better than P-L. Specifically, for the log-likelihood, 
predictive log-likelihood, and AIC criteria, RUMs with normal distributions outperform P-L with 
95% confidence in both datasets. 

2 RUMs and Exponential Families 

In social choice, each agent i G {1, . . . ,n} has a strict preference order on alternatives. This 
provides the data for an inferential approach to social choice. In particular, let L(C) denote the set 
of all linear orders on C. Then, a preference-profile, D, is a set of n preference orders, one from 
each agent, so that D e L(C) n . 

A voting rule r is a mapping that assigns to each preference-profile a set of winning rankings, 
r : L{C) n i ^ (2 L < C ) \ 0). In particular, in the case of ties the set of winning rankings may include 
more than a singleton ranking. In the maximum likelihood (MLE) approach to social choice, the 
preference profile is viewed as data, D = {tt 1 , . . . , 7r"}. 

Given this, the probability (likelihood) of the data given ground truth 6 (and for a particular p) is 
Pr(D | 9) = nr=i Pr(V I #)> where, 

poo poo poo 

P(-k\0)= / .. / ^7r(n)(Xn(n))--^7v(l)(Xn(l))dx^^dx 7T ^)..dx 7T ^ l ) (2) 

The MLE approach to social choice selects as the winning ranking that which corresponds to the 9 
that maximizes Pr(D | 6 1 ). In the case of multiple parameters that maximize the likelihood then the 
MLE approach returns a set of rankings, one ranking corresponding to each parameterization. 

In this paper, we focus on probabilistic models where each Hj belongs to the exponential family 
(EF). The density function for each p, in EF has the following format: 

Pr(X = x) = p(x) = e-i(W*H(«)+BW , (3) 

where ry(-) and A(-) are functions of 9, B(-) is a function of x, and T(x) denotes the sufficient 
statistics for x, which could be multidimensional. 

Example 1 (Plackett-Luce as an RUM |2|) In the RUM, let fij 's be Gumbel distributions. That 
is, for alternative j G {l,...,m} we have p,j(xj\9j) — e~( Xi ~ e ^e~ e 1 3 3> . Then, we have: 
Pr(n | A) = Pr(x„m > x n(? ) > .. > a^ (m) ) = ]TLi T 77, > where vi^j) = x j = e03 > 
T(xj) = —e~ Xj , B(xj) = —Xj and A(9j) — —9j.This gives us the Plackett-Luce model. 

3 Global Optimality and Log- Concavity 

In this section, we provide a condition on distributions that guarantees that the likelihood function |2]) 
is log-concave in parameters 9. We also provide a condition under which the set of MLE solutions 
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is bounded when any one latent parameter is fixed. Together, this guarantees the convergence of our 
MC-EM approach to a global mode with an accurate enough E-step. We focus on the location family, 
which is a subset of RUMs where the shapes of all fij's are fixed, and the only parameters are the 
means of the distributions. For the location family, we can write Xj = 9j + Q, where Xj <~ Mj'('l^i) 
and Q = Xj — 9j is a random variable whose mean is and models an agent's subjective noise. 
The random variables (Ys do not need to be identically distributed for all alternatives j; e.g., they 
can be normal with different fixed variances. 

We focus on computing solutions (9) to maximize the log-likelihood function, 

n 

l(8;D)=Y,logPr(n*\e) (4) 

»=1 

Theorem 1 For the location family, if for every j < m the probability density function for Q is 
log-concave, then 1(9; D) is concave. 

Proof sketch: The theorem is proved by applying the following lemma, which is Theorem 9 in |22j. 

Lemma 1 Suppose g\(9, Q, gn(9, £) are concave functions in M 2 " 1 where 9 is the vector of m 
parameters and £ is a vector ofm real numbers that are generated according to a distribution whose 
pdfis logarithmic concave in K™ 1 . Then the following function is log-concave in W n . 

Lift G) = Pr( 5l (0, C) > 0, g R (9, C) > 0), 9e R m (5) 

To apply Lemma[T[ we define a set G % of function <? l 's that is equivalent to an order tt j in the sense of 
inequalities implied by RUM for if 1 and G l (the joint probability in (|5jl for G l to be the same as the 
probity of vr' in RUM with parameters 9). Suppose g l r (9, <f) = 9^ {r) + Q i(r) - ^(r+i) ~ C«( r+1 ) 
for r = 1, .., m — 1. 

Then considering that the length of order ir l is R + 1, we have: 

Li (0, tt^^L, [6, G l ) = Pr( 5 i (6, C) > 0, g* R (8, C) > 0) , 9e E m (6) 

This is because gl(9, C) > is equivalent to that in 7r* alternative ir l (r) is preferred to alternative 
TT l (r + 1) in the RUM sense. 

To see how this extends to the case where preferences are specified as partial orders, we consider 
in particular an interpretation where an agent's report for the ranking of to, alternatives implies that 
all other alternatives are worse for the agent, in some undefined order. Given this, define g % r (9, () — 

9^ {r) + Q i(r) - 6^ (r+1) - C !(r+1) for r = 1, ..,rrn - 1 and g l r {9,() - 9^ {mi) + CU{ mi ) ~ 

^7r*(r+i) ~ C^fr+i) f° r r = m *' "' m — Considering that gl(-)& are linear (hence, concave) and 

using log concavity of the distributions of £ l = Q, .., Cm)' s > we can a PPly Lemma[T|and prove 
log-concavity of the likelihood function. □ 

It is not hard to verify that pdfs for normal and Gumbel are log-concave under reasonable conditions 
for their parameters, made explicit in the following corollary. 

Corollary 1 For the location family where each Q is a normal distribution with mean zero and 

with fixed variance, or Gumbel distribution with mean zeros and fixed shape parameter, 1(6; D) is 
concave. Specifically, the log-likelihood function for P-L is concave. 

The concavity of log-likelihood of P-L has been proved [9| using a different technique. Using Fact 
3.5. in [24 1, the set of global maxima solutions to the likelihood function, denoted by Sd, is convex 
since the likelihood function is log-concave. However, we also need that Sd is bounded, and would 
further like that it provides one unique order as the estimation for the ground truth. 

For P-L, Ford, Jr. [9| proposed the following necessary and sufficient condition for the set of global 
maxima solutions to be bounded (more precisely, unique) when X^JLi e6j = !■ 

Condition 1 Given the data D, in every partition of the alternatives C into two nonempty subsets 
C\ U C2, there exists c\ € C\ and C2 G C2 such that there is at least one ranking in D where c\ >~ C2. 
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We next show that Condition [T] is also a necessary and sufficient condition for the set of global 
maxima solutions Sd to be bounded in location families, when we set one of the values 9j to be 

(w.l.o.g., let 9\ — 0). If we do not bound any parameter, then Sd is unbounded, because for any 9, 
any D, and any number s£l, 1(9; D) = 1(9 + s; D). 

Theorem 2 Suppose we fix 9\ = 0. Then, the set Sd of global maxima solutions to 1(9; D) is 
bounded if and only if the data D satisfies Condition^ 

Proof sketch: 

If Condition [T] does not hold, then Sd is unbounded because the parameters for all alternatives in 
C\ can be increased simultaneously to improve the log-likelihood. For sufficiency, we first present 
the following lemma. 

Lemma 2 If alternative j is preferred to alternative f in at least in one ranking then the difference 
of their mean parameters 9y — 9j is bounded from above (3Q where 9j> — 9j < Q) for all the 9 
that maximize the likelihood function. 
Proof: Suppose that j >~ j' in rank i, then for any 9 £ R m : 

Li(9, tt 4 ) = Li(9, G l ) = Pr( 9l (9, () > 0, g R (9, <f) > 0) 

<Pv(g^ {r) (9,C) > 0,g„i (r+1) (9,C) > 0, . . . ,g^(ff, () > 0) < Pv(Q - Q, > 9 r -9 j ), (7) 

where j = Tr l (r) and f = 7r*(r'). 

Let K = 1(0; D). Since the log-likelihood is always smaller than 0, it follows that for any 9 6 Sd 

and any i < n, L t (9; n 1 ) > K. 

Hence, Pr(Cj - Cf > 8j> ~0j)> K. 

Therefore, there exists K' such that 9y — 9j < K', where K' depends on the fixed Q> and Q, □ 

Now consider a directed graph Gd, where the nodes are the alternatives, and there is an edge be- 
tween Cj to Cji if in at least one ranking Cj >- cy . By Condition]!] for any pair j ^ j', there is a path 
from Cj to Cj' (and conversely, a path from to Cj). To see this, consider building a path between 
j and j' by starting from a partition with C\ = {j} and following an edge from j to j± in the graph 
where j\ is an alternatives in C2 for which there must be such an edge, by Condition[T] Consider the 
partition with C\ = {j,ji}, and repeat until an edge can be followed to vertex j' 6 C2. It follows 
from Lemmablthat for any 9 £ Sd we have \9j — 9j> \ < Qm, using the telescopic sum of bounded 
values of the difference of mean parameters along the edges of the path, since the length of the path 
is no more than m (and tracing the path from j to j' and j' to j), meaning that Sd is bounded. □ 

Now that we have the log concavity and bounded property, we need to declare conditions under 
which the bounded convex space of estimated parameters corresponds to a unique order. The next 
theorem provides a necessary and sufficient condition for all global maxima to correspond to the 
same order on alternatives. Suppose that we order the alternatives based on estimated #'s (meaning 
that Cj is ranked higher than Cji iff 9j > Oy). 

Theorem 3 The order over parameters is strict and is the same across all 9 e Sd if for all 9 G Sd 
and all alternatives j ^ j', 9j ^ 9ji. 

Proof: Suppose for the sake of contradiction there exist two maxima, 9,9* 6 Sd and a pair of 
alternatives j ^ j' such that 9j > 9j> and 9*, > 9*. Then, there exists an a < 1 such that the jth 

and j'th components of a9 + (1 — ct)9* are equal, which contradicts the assumption. □ 

Hence, if there is never a tie in the scores in any 9 € Sd, then any vector in Sd will reveal the 
unique order. 



4 Monte Carlo EM for Parameter Estimation 



In this section, we propose an MC-EM algorithm for MLE inference for RUMs where every [ij 
belongs to the EF^] 

2 Our algorithm can be naturally extended to compute a maximum a posteriori probability (MAP) estimate, 
when we have a prior over the parameters 9. Still, it seems hard to motivate the imposition of a prior on 
parameters in many social choice domains. 
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The EM algorithm determines the MLE parameters 9 iteratively, and proceeds as follows. In each 
iteration t + 1, given parameters 9 l from the previous iteration, the algorithm is composed of an 
E-step and an M-step. For the E-step, for any given = (dx, . . . , 9 m ), we compute the conditional 
expectation of the complete-data log-likelihood (latent variables x and data D), where the latent 
variables x are distributed according to data D and parameters 9 l from the last iteration. 

For the M-step, we optimize 9 to maximize the expected log-likelihood computed in the E-step, and 
use it as the input 9 t+1 for the next iteration: 

E-Step : Q{9,9 t )=E x \ log f[ Pr(x\ ir l \ 9) \ D, 9 l \ 



M-step : 



G argmaxQ(#, 6>*) 



4.1 Monte Carlo E-step by Gibbs sampler 

The E-step can be simplified using ([3]l as follows: 

n n 

^{logJJPrOpy | 9) | D,&} = ^{logJJPr^l ^Pr( 7 r i |f i ) I A^} 

i=l i=l 



i=l j = l 



{log N (x) \e j )\-K i ,e t } 



i=l j=l 



where W ~ E X i{B(x t i ) I ■n l 1 9 t } only depends on 9 t and D (not on 9), which means that it can 
be treated as a constant in the M-step. Hence, in the E-step we only need to compute Sy t+1 = 
E X i{T(x^) | n' t ,9 t } where T(x l j) is the sufficient statistic for the parameter 9j in the model. We 
are not aware of an analytical solution for E X i{T(x l : j) \ ■n l : 9 t }. However, we can use a Monte 

Carlo approximation, which involves sampling x 1 from the distribution Pr(x I 1 7r\ #*) using a Gibbs 
sampler, and then approximates Sj' t+1 by J2k=i T(Xj ) where N is the number of samples in 
the Gibbs sampler. 

In each step of our Gibbs sampler for voter i, we randomly choose a position j in tt 1 and 
sample a^iyj according to a TruncatedEF distribution Pr(-| av^-), 6**, 71-*), where x v i^_j} — 
( avm, . . . , x^iu-i), Xni(j+i), ■ ■ ■ , x-K^m))- The TruncatedEF is obtained by truncating the tails 
of / i 7r i (j)('l^^i( :) )) at s^Cj-i) anc l a; 3r i (3+i)' respectively. For example, a truncated normal distribu- 
tion is illustrated in Figure [2] 




Figure 2: A truncated normal distribution. 



Rao-Blackwellized: To further improve the 
Gibbs sampler, we use Rao-Blackwellized [4] 
estimation using E{T(x l j k ) 
instead of the sample x l - k , where x 



i,k 



is all 



of x* ,fc except for x 



i . k 



mate E{T(x) 



i,k\ 



i . k 



Finally, we esti- 

, 7T l , (9* } in each step 
of the Gibbs sampler using M samples as 

*>*+! ^ 1 pfT^'. 1- ! I „k „i an 



S 



'} 



l M Y,k=iY J iLT{x l -' k )i where 



NM 



X A 



Pt(x]' k I 7r l ,6»). Rao-Blackwellization 

reduces the variance of the estimator be- 
cause of conditioning and expectation in 



E{T( 



i,k\ i 

X? ) | X 



i , k 



'}■ 
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4.2 M-step 



In the E-step we have (approximately) computed S L f . In the M-step we compute 9 t+1 to max- 
imize Y2=iE?=M9i)Ex*{T(x$) \ 7^,0*} - A(^) + E X] {B{x)) \ Equivalently, we 

compute 9j +1 for each j < m separately to maximize 'Y^ = i{ri{9j)E X i{T{x^) \ tt 1 : 9 1 } — A(9j)} — 

For the case of the normal distribution with fixed variance, where r)(9j) = 28 j and A(9j) = (9j) 2 , 
wehave^IIXiSr 1 - 



ere rj{9j) - 
i Figure 3 
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MC-EM 



The algorithm is illustrated in '. 

MC E-Step: Rao-Blackwellized Gibbs-Sampler 



Sete 

While (\9 t+l 



For (i 



1 : n in parallel) 



Approximate for j = 1 : m 



S 



i,t+l 
j 



E{T(x))} 



t+i 



i 

7) 




For A: = 1 : iV 

j ~ UniformlX : m) 

ij w - TruncatedEF (Xj = xj'* | Oj , , it' ) 
For / = 1 : M 

x'j ~ TruncatedEF (A'j = ' fc | 9j,x'^, tt') 



For j = 1 : m 



Figure 3: The MC-EM algorithm for normal distribution. 



4.3 Convergence 

In the last section we showed that if the RUM satisfies the premise in Theorem[T]and Theorem[2]the 
data satisfies Condition T] then the log-likelihood function is concave, and the set of global maxima 
solutions is bounded. This guarantee the convergence of MC-EM for an exact E-step. 

In general, MC-EM methods do not have the uniform convergence property of EM methods. In 
order to control the error of approximation in the MC-E step we can increase the number of samples 
with the iterations |28| . However, in our application, we are not concerned with the exact estimation 
of 9, as we are only interested in their orders relative to each-other. Therefore, as long as the 
approximation error remains relatively small, such that the differences of 9jS are much larger than 
the error, we are safe to stop. 

A known problem with Gibbs sampling is that it can introduce correlation among samples. To 
address this, we sub-sample the samples to reduce the correlation, and call the ratio of sub-sampling 
the thinning factor (0 < F < 1). A suitable thinning ratio can be set using empirical results from 
the sampler. 

With an approach similar to JiJ, we can derive a relationship between the variance of error in 9 t+1 
and the Monte-Carlo error in the E-step approximation: 

Var(9/ +1 ) = ^Y Var(S\ t+1 ) - — — V Var{x)) < — (8) 
w i 2 ; MNn 2 ^ y 3 ~ MNn 

i=l i=l 

where N is number of samples in Gibbs sampler, M is the number of samples for Rao- 
Blackwellization, n is number of agents, F is the thinning factor and V = maxj( Var xr ^ fij (x)), 
and samples x* are assumed to be independent. Given, T, V and n, we can make Var(9/ +1 ) 
arbitrarily small by increasing MN. 

5 Experimental Results 

We evaluate the proposed MC-EM algorithm on synthetic data as well as two real world data sets, 
namely an election data set and a dataset representing preference orders on sushi. For simulated data 
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we use the Kendall correlation [11] between two rank orders (typically between the true order and 
the method's result) as a measure of performance. 

5.1 Experiments for Synthetic Data 

We first generate data from Normal models for the random utility terms, with means 9j = j and 
equal variance for all terms, for different choices of variance ( Var = 2, 4). We evaluate the perfor- 
mance of the method as the number of agents n varies. The results show that a limited number of 
iterations in the EM algorithm (at most 3), and samples MN — 4000 (M=5, N=800) are sufficient 
for inferring the order in most cases. The performance in terms of Kendall correlation for recovering 
ground truth improves for larger number of agents, which corresponds to more data. See Figure |4] 
which shows the asymptotic behavior of the maximum likelihood estimator in recovering the true 
parameters. Figure [4] left and middle panels show that the more the size of dataset the better the 
performance of the method. 

Moreover, for large variances in data generation, due to increasing noise in the data, the rate that 
performance gets better is slower than that for the case for smaller variances. Notice that the scales 
on the y-axis are different in the left and middle panels. 




5 10 15 5 10 15 50 100 150 200 250 

Number of Agents n Number of Agents n Number of agents in sub-sample 

Figure 4: Left and middle panel: Performance for different number of agents n on synthetic data for m — 5, 10 
and Var = 2, 4, with specifications MN = 4000, EMiterations = 3. Right panel: Performance given 
access to sub-samples of the data in the public election dataset, x-axis: size of sub-samples, y-axis: Kendall 
Correlation with the order obtained from the full data-set. Dashed lines are the 95% confidence intervals. 



5.2 Experiments for Model Robustness 

We apply our method to a public election dataset collected by Nicolaus Tideman [27 1, where the 
voters provided partial orders on candidates. A partial order includes comparisons among a subset 
of alternative, and the non-mentioned alternatives in the partial order are considered to be ranked 
lower than the lowest ranked alternative among mentioned alternatives. 

The total number of votes are n = 280 and the number of alternatives m = 15. For the purpose of 
our experiments, we adopt the order on alternatives obtained by applying our method on the entire 
dataset as an assumed ground truth, since no ground truth is given as part of the data. After finding 
the ground truth by using all 280 votes (and adopting a normal model), we compare the performance 
of our approach as we vary the amount of data available. We evaluate the performance for sub- 
samples consisting of 10, 20, ... , 280 of samples randomly chosen from the full dataset. For each 
sub-sample size, the experiment is repeated 200 times and we report the average performance and 
the variance. See the right panel in Figure |4] This experiment shows the robustness of the method, 
in the sense that the result of inference on a subset of the dataset shows consistent behavior with the 
case that the result on the full dataset. For example, the ranking obtained by using half of the data 
can still achieve a fair estimate to the results with full data, with an average Kendall correlation of 
greater than 0.4. 

5.3 Experiments for Model Fitness 

In addition to a public election dataset, we have tested our algorithm on a sushi dataset, where 5000 
users give rankings over 10 different kinds of sushi p3| . For each experiment we randomly choose 
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n E {10, 20, 30, 40, 50} rankings, apply our MC-EM for RUMs with normal distributions where 
variances are also parameters. 

In the former experiments, both the synthetic data generation and the model for election data, the 
variances were fixed to 1 and hence we had the theoretical guarantees for the convergence to global 
optimal solutions by Theorem[T]and Theorem|2] When we let the variances to be part of parametriza- 
tion we lose the theoretical guarantees. However, the EM algorithm can still be applied, and since 
the variances are now parameters (rather than being fixed to 1), the model fits better in terms of 
log-likelihood. 

For this reason, we adopt RUMs with normal distributions in which the variance is a parameter that 
is fit by EM along with the mean. We call this model a normal model. We compute the difference 
between the normal model and P-L in terms of four criteria: log-likelihood (LL), predictive log- 
likelihood (predictive LL), AIC, and BIC. For (predictive) log-likelihood, a positive value means 
that normal model fits better than P-L, whereas for AIC and BIC, a negative number means that 
normal model fits better than P-L. Predictive likelihood is different from likelihood in the sense 
that we compute the likelihood of the estimated parameters for a part of the data that is not used for 
parameter estimation]^] In particular, we compute predictive likelihood for a randomly chosen subset 
of 100 votes. The results and standard deviations for n = 10, 50 are summarized in Table]]] 





n = 10 


n = 50 


Dataset 


LL 


Pred. LL 


AIC 


BIC 


LL 


Pred. LL 


AIC 


BIC 


Sushi 


8.8(4.2) 


-56.1(89.5) 


-7.6(8.4) 


5.4(8.4) 


22.6(6.3) 


40.1(5.1) 


-35.2(12.6) 


-6.1(12.6) 


Election 


9.4(10.6) 


91.3(103.8) 


-8.8(21.2) 


4.2(21.2) 


44.8(15.8) 


87.4(30.5) 


-79.6(31.6) 


-50.5(31.6) 



Table 1 : Model selection for the sushi dataset and election dataset. Cases where the normal model fits better 
than P-L statistically with 95% confidence are in bold. 



When n is small (n — 10), the variance is high and we are unable to obtain statistically significant 
results in comparing fitness. When n is not too small (n = 50), RUMs with normal distributions 
fit better than P-L. Specifically, for log-likelihood, predictive log-likelihood, and AIC, RUMs with 
normal distributions outperform P-L with 95% confidence in both datasets. 

5.4 Implementation and Run Time 

The running time for our MC-EM algorithm scales linearly with number of agents on real world 
data (Election Data) with slope 13.3 second per agent on an Intel i5 2.70GHz PC. This is for 100 
iterations of EM algorithm with Gibbs sampling number increasing with iterations as 2000 + 300 * 
iteration steps. 
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